Skip to content

feat(webapp,run-engine): mollifier drainer replay + stale sweep + cancelled-run engine API#3754

Draft
d-cs wants to merge 5 commits into
mollifier-phase-3-triggerfrom
mollifier-phase-3-replay
Draft

feat(webapp,run-engine): mollifier drainer replay + stale sweep + cancelled-run engine API#3754
d-cs wants to merge 5 commits into
mollifier-phase-3-triggerfrom
mollifier-phase-3-replay

Conversation

@d-cs
Copy link
Copy Markdown
Collaborator

@d-cs d-cs commented May 26, 2026

Summary

The replay side of the mollifier:

  • DrainerHandler: reads buffered snapshots and replays them through engine.trigger to materialise PG rows.
  • RunEngine.createCancelledRun: new public method the handler uses to write CANCELED rows directly from snapshots (bypass queue + waitpoint, emit runCancelled). Tolerates the cjson empty-table tags edge case found during validation.
  • Drainer fairness: org → env rotation so a heavy env doesn't starve light ones in the same org.
  • Stale-entry sweep + telemetry + alertable gauge so a stuck/offline drainer surfaces in alerts.

Both the drainer and sweep default-off; nothing fires unless flagged on (TRIGGER_MOLLIFIER_DRAINER_ENABLED, TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED).

Stacked on the trigger-time decisions PR.

Test plan

  • `pnpm run typecheck --filter webapp` passes
  • `pnpm run test --filter webapp test/mollifierDrainerHandler.test.ts` passes
  • `pnpm run test --filter webapp test/mollifierStaleSweep.test.ts` passes
  • `pnpm run test --filter @internal/run-engine src/engine/tests/createCancelledRun.test.ts` passes
  • `pnpm run test --filter @trigger.dev/redis-worker packages/redis-worker/src/mollifier/drainer.test.ts` passes

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 26, 2026

⚠️ No Changeset found

Latest commit: 242ba73

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 89623433-4e39-4bea-9e9f-fbaf302aad06

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mollifier-phase-3-replay

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread apps/webapp/app/v3/mollifier/mollifierDrainerHandler.server.ts
Comment thread apps/webapp/app/v3/mollifier/mollifierTelemetry.server.ts
Comment thread apps/webapp/test/mollifierStaleSweep.test.ts Outdated
Comment thread internal-packages/run-engine/src/engine/index.ts
@d-cs d-cs self-assigned this May 26, 2026
@d-cs d-cs force-pushed the mollifier-phase-3-trigger branch from 626a8dc to af7368e Compare May 26, 2026 11:12
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from 31f4726 to b05929b Compare May 26, 2026 11:12
@d-cs d-cs force-pushed the mollifier-phase-3-trigger branch from 5a7bc19 to baa6f17 Compare May 26, 2026 13:24
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from b05929b to b89da52 Compare May 26, 2026 13:24
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment thread apps/webapp/app/entry.server.tsx Outdated
Comment thread internal-packages/run-engine/src/engine/index.ts
d-cs and others added 4 commits May 26, 2026 17:20
…celled-run engine API

The replay side of the mollifier:
- DrainerHandler that reads buffered snapshots and replays them
  through engine.trigger to materialise PG rows.
- RunEngine.createCancelledRun: new public method the handler uses to
  write CANCELED rows directly from snapshots (bypass queue +
  waitpoint, emit runCancelled). Tolerates cjson empty-table tags.
- Drainer fairness: org → env rotation so a heavy env doesn't starve
  light ones in the same org.
- Stale-entry sweep + telemetry + alertable gauge for stuck drainers.

Both drainer and sweep default-off; nothing fires unless flagged on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `isRetryablePgError`: also accept `errorCode === "P1001"` so
  `PrismaClientInitializationError` (which surfaces P1001 on a
  different field than `PrismaClientKnownRequestError`) retries.
- Drop `envId` from OTel metric labels on
  `mollifier.realtime_subscriptions.buffered`,
  `mollifier.stale_entries`, and the
  `mollifier.stale_entries.current` gauge. `envId` is a banned
  high-cardinality attribute; the structured warn log alongside each
  counter tick still carries envId for forensic drill-down.
- Stale-sweep test name + comments now match the assertion shape
  (all three entries stale, not "two stale + one fresh").
- `RunEngine.createCancelledRun` P2002 path now requires the existing
  row's status to be CANCELED; a non-canceled conflict throws rather
  than silently reporting success, so the caller can route to
  `engine.cancelRun()` or skip.
- Regression test pins the new conflict guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…leton

Importing the production drainer wiring transitively loads
\`~/v3/runEngine.server\`, whose top-level \`singleton(...)\` eagerly
constructs a RunEngine. The constructor spins up Prisma + Redis
workers that try to connect to localhost — in CI (no PG, no Redis)
that produces an unhandled \`PrismaClientInitializationError\` which
fails the run even though every assertion passes. Mock the runEngine
and prisma modules so the unit test exercises only the bootstrap's
error classification, not a live engine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Container startup + the sweep loop can exceed Vitest's 5s default on
CI runners (passes in ~1.7-2s locally). Matches the explicit
\`{ timeout: 20_000 }\` other mollifier redisTests carry across the
project.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from 74fdf6d to c6fa61f Compare May 26, 2026 16:20
Two bugs flagged by Devin on PR #3754:

1. entry.server.tsx reverted to \`void sessionsReplicationInstance;\`,
   which esbuild tree-shakes under \`"sideEffects": false\`. Restored
   the globalThis assignment + warning comment from #3738 (incident
   TRI-9864). Without this the sessions→ClickHouse logical replication
   slot stops being consumed at boot.

2. createFailedTaskRun unconditionally emitted \`runFailed\`, which
   the \`completeFailedRunEvent\` listener uses to write a span
   completion into ClickHouse. But TriggerFailedTaskService.call()
   already wraps createFailedTaskRun inside
   \`repository.traceEvent({ incomplete: false, isError: true })\`
   which writes its own completion row for the same (traceId, spanId).
   Two completions racing on the same span row is a real
   observability bug.

   Added an \`emitRunFailedEvent: boolean = true\` opt-out. The
   TriggerFailedTaskService.call() path now passes \`false\` and
   enqueues \`PerformTaskRunAlertsService\` directly after the trace
   event closes so the alerts side of \`runFailed\` is preserved.
   \`callWithoutTraceEvents\` and the mollifier drainer's terminal-
   failure path keep the default emit (they have no outer trace
   event managing the span).

   Regression test pins the opt-out: \`emitRunFailedEvent: false\`
   writes the PG row but does NOT fire the bus event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant